首页> 外文OA文献 >Unsupervised learning of probabilistic grammars
【2h】

Unsupervised learning of probabilistic grammars

机译:概率语法的无监督学习

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Probabilistic grammars define joint probability distributions over sentences and their grammatical structures. They have been used in many areas, such as natural language processing, bioinformatics and pattern recognition, mainly for the purpose of deriving grammatical structures from data (sentences). Unsupervised approaches to learning probabilistic grammars induce a grammar from unannotated sentences, which eliminates the need for manual annotation of grammatical structures that can be laborious and error-prone. In this thesis we study unsupervised learning of probabilistic context-free grammars and probabilistic dependency grammars, both of which are expressive enough for many real-world languages but remain tractable in inference. We investigate three different approaches.The first approach is a structure search approach for learning probabilistic context-free grammars. It acquires rules of an unknown probabilistic context-free grammar through iterative coherent biclustering of the bigrams in the training corpus. A greedy procedure is used in our approach to add rules from biclusters such that each set of rules being added into the grammar results in the largest increase in the posterior of the grammar given the training corpus. Our experiments on several benchmark datasets show that this approach is competitive with existing methods for unsupervised learning of context-free grammars.The second approach is a parameter learning approach for learning natural language grammars based on the idea of unambiguity regularization. We make the observation that natural language is remarkably unambiguous in the sense that each natural language sentence has a large number of possible parses but only a few of the parses are syntactically valid. We incorporate this prior information into parameter learning by means of posterior regularization. The resulting algorithm family contains classic EM and Viterbi EM, as well as a novel softmax-EM algorithm that can be implemented with a simple and efficient extension to classic EM. Our experiments show that unambiguity regularization improves natural language grammar learning, and when combined with other techniques our approach achieves the state-of-the-art grammar learning results.The third approach is grammar learning with a curriculum. A curriculum is a means of presenting training samples in a meaningful order. We introduce the incremental construction hypothesis that explains the benefits of a curriculum in learning grammars and offers some useful insights into the design of curricula as well as learning algorithms. We present results of experiments with (a) carefully crafted synthetic data that provide support for our hypothesis and (b) natural language corpus that demonstrate the utility of curricula in unsupervised learning of real-world probabilistic grammars.
机译:概率语法定义句子及其语法结构上的联合概率分布。它们已经在许多领域中使用,例如自然语言处理,生物信息学和模式识别,主要目的是从数据(句子)中导出语法结构。用于学习概率语法的无监督方法可从未注释的句子中诱发语法,从而消除了手动标记语法结构的麻烦,因为该语法结构既费力又容易出错。在本文中,我们研究了概率上下文无关文法和概率依存文法的无监督学习,这两种语言对于许多现实世界的语言都足够表达,但在推理中仍然易于处理。我们研究了三种不同的方法。第一种方法是用于学习概率上下文无关语法的结构搜索方法。它通过迭代训练语料库中的二元组的连贯二元组来获得未知概率上下文无关文法的规则。在我们的方法中,使用贪婪过程从二元组中添加规则,这样,在给定训练语料库的情况下,添加到语法中的每组规则都会导致语法后部的最大增加。我们在几个基准数据集上的实验表明,该方法与现有的无上下文语法的无监督学习方法相比具有竞争优势。第二种方法是基于无歧义正则化思想的自然语言语法的参数学习方法。我们观察到,自然语言在每个自然语言句子都有大量可能的语法分析的意义上非常明确,但是在语法上只有少数语法分析是有效的。我们通过后验正则将该先验信息整合到参数学习中。生成的算法系列包含经典EM和Viterbi EM,以及可以通过简单有效地扩展经典EM来实现的新型softmax-EM算法。我们的实验表明,歧义正则化可以改善自然语言语法学习,并且当与其他技术结合使用时,我们的方法可以获得最新的语法学习结果。第三种方法是通过课程进行语法学习。课程是一种以有意义的顺序呈现训练样本的方法。我们介绍了渐进式构造假说,该假说解释了课程在学习语法中的好处,并为课程设计和学习算法提供了一些有用的见解。我们提供的实验结果是(a)精心制作的合成数据为我们的假设提供了支持,(b)自然语言语料库证明了课程在现实世界中概率语法的无监督学习中的实用性。

著录项

  • 作者

    Tu, Kewei;

  • 作者单位
  • 年度 2012
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号